70 research outputs found

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    A comparison of cache hierarchies for SMT processors

    Get PDF
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer ReviewedPostprint (author's final draft

    Late allocation and early release of physical registers

    Get PDF
    The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version

    Concertina: Squeezing in cache content to operate at near-threshold voltage

    Get PDF
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.Peer ReviewedPostprint (author's final draft

    Virtual registers

    Get PDF
    The number of physical registers is one of the critical issues of current superscalar out-of-order processors. Conventional architectures allocate, in the decoding stage, a new storage location (e.g. a physical register) for each operation that has a destination register. When an instruction is committed, it frees the physical register allocated to the previous instruction that had the same destination logical register. Thus, an additional register (i.e. in addition to the number of logical registers) is used for each instruction with a destination register from the time it is decoded until it is committed. In this paper, we propose a novel register organization that allocates physical registers when instructions complete their execution. In this way, the register pressure is significantly reduced, since the additional register is only used from the time execution completes until the instruction is committed. For some long-latency instructions (e.g. load with a cache miss) and for parts of the code with a small amount of parallelism, the savings could be very high. We have evaluated the new scheme for a superscalar processor and obtained a significant speedup.Peer ReviewedPostprint (published version

    Light NUCA: a proposal for bridging the inter-cache latency gap

    Get PDF
    To deal with the “memory wall” problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer-granularity than conventional caches reducing cache latency. Our evaluations show that in general, L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.Postprint (author’s final draft

    Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

    Get PDF
    Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low. This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%

    STT-RAM memory hierarchy designs aimed to performance, reliability and energy consumption

    Get PDF
    Current applications demand larger on-chip memory capacity since off-chip memory accesses be-come a bottleneck. However, if we want to achieve this by scaling down the transistor size of SRAM-based Last-Level Caches (LLCs) it may become prohibitive in terms of cost, area and en-ergy. Therefore, other technologies such as STT-RAM are becoming real alternatives to build the LLC in multicore systems. Although STT-RAM bitcells feature high density and low static power, they suffer from other trade-offs. On the one hand, STT-RAM writes are more expensive than STT-RAM reads and SRAM writes. In order to address this asymmetry, we will propose microarchitectural techniques to minimize the number of write operations on STT-RAM cells. On the other hand, reliability also plays an important role. STT-RAM cells suffer from three types of errors: write, read disturbance, and retention errors. Regarding this, we will suggest tech-niques to manage redundant information allowing error detection and information recovery.Postprint (published version

    Gestión de contenidos en caches operando a bajo voltaje

    Get PDF
    La eficiencia energética de las caches en chip puede mejorarse reduciendo su voltaje de alimentación (Vdd ). Sin embargo, este escalado de Vdd está limitado a una tensión Vddmin por debajo de la cual algunas celdas SRAM (Static Random Access Memory) puede que no operen de forma fiable. Block disabling (BD) es una técnica microarquitectónica que permite operar a tensiones muy bajas desactivando aquellas entradas que contienen alguna celda que no opera de forma fiable, aunque a cambio de reducir la capacidad efectiva de la cache. Se utiliza en caches de último nivel (LLC), donde el ahorro potencial es mayor. Sin embargo, para algunas aplicaciones, el incremento de consumo debido a los accesos a memoria fuera del chip no compensa el ahorro energético conseguido en la LLC. Este trabajo aprovecha recursos existentes en los multiprocesadores, como son la jerarqui´a de memoria en chip y el mecanismo de coherencia, para mejorar las prestaciones de BD. En concreto, proponemos explotar la redundancia natural existente en una jerarqui´a de cache inclusiva para mitigar la pérdida de rendimiento debida a la reducción en la capacidad de la LLC. También proponemos una nueva poli´tica de gestión de contenidos consciente de la existencia de entradas de cache defectuosas. Utilizando la información de reúso, el algoritmo de reemplazo asigna entradas de cache operativas a aquellos bloques con más probabilidad de ser referenciados. Las técnicas propuestas llegan a reducir el MPKI hasta en un 36.4 % respecto a block disabling, mejorando su rendimiento entre un 2 y un 13%.Peer ReviewedPostprint (author's final draft

    ReD: A reuse detector for content selection in exclusive shared last-level caches

    Get PDF
    The reference stream reaching a chip multiprocessor Shared Last-Level Cache (SLLC) shows poor temporal locality, making conventional cache management policies inefficient. Few proposals address this problem for exclusive caches. In this paper, we propose the Reuse Detector (ReD), a new content selection mechanism for exclusive hierarchies that leverages reuse locality at the SLLC, a property that states that blocks referenced more than once are more likely to be accessed in the near future. Being placed between each L2 private cache and the SLLC, ReD prevents the insertion of blocks without reuse into the SLLC. It is designed to overcome problems affecting similar recent mechanisms (low accuracy, reduced visibility window and detector thrashing). ReD improves performance over other state-of-the-art proposals (CHAR, Reuse Cache and EAF cache). Compared with the baseline system with no content selection, it reduces the SLLC miss rate (MPI) by 10.1% and increases harmonic IPC by 9.5%.Peer ReviewedPostprint (author's final draft
    corecore